feat(ci): add CI-Ready GPU Brev launchable for E2E tests by ksapru · Pull Request #1456 · NVIDIA/NemoClaw

ksapru · 2026-04-03T19:11:05Z

Summary

This PR introduces a CI-Ready GPU Brev launchable (scripts/brev-launchable-ci-gpu.sh), mirroring the architecture of the CPU launchable to prevent repeated, expensive setup overhead in our GPU E2E tests. By pre-baking Docker, Node.js, the OpenShell CLI, the NVIDIA Container Toolkit, and pre-pulling Ollama models (qwen3:0.6b / qwen2.5:0.5b), this eliminates 5-10 minutes of bootstrapping during CI. Additionally, this PR mitigates CI flakiness by bumping default Vitest timeouts from 5s to 30s on heavy integration tests to ensure tests run smoothly on VM instances.

Related Issue

Fixes #1328

Changes

New GPU Startup Script: Added scripts/brev-launchable-ci-gpu.sh with robust GPU passthrough validation, graceful apt-lock handling, and reliable asynchronous Ollama polling defaults.
GPU Test Suite Enablement: Updated test/e2e/brev-e2e.test.js to run the GPU E2E suite conditionally when TEST_SUITE === "gpu".
Workflow Pipeline: Modified .github/workflows/nightly-e2e.yaml to integrate the GPU-specific launch pipeline path.
Test Timeout Fixes: Increased Vitest timeouts for test/install-preflight.test.js, test/nemoclaw-cli-recovery.test.js, and test/onboard.test.js from 5,000ms to 30,000ms to eliminate arbitrary timeout failures during heavy local and remote executions.

Type of Change

Code change for a new feature, bug fix, or refactor.
Code change with doc updates.
Doc only. Prose changes without code sample modifications.
Doc only. Includes code sample changes.

Testing

npx prek run --all-files passes (or equivalently make check).
npm test passes.
make docs builds without warnings. (for doc-only changes)

Checklist

General

I have read and followed the contributing guide.
I have read and followed the style guide. (for doc-only changes)

Code Changes

Formatters applied — npx prek run --all-files auto-fixes formatting (or make format for targeted runs).
Tests added or updated for new or changed behavior.
No secrets, API keys, or credentials committed.
Doc pages updated for any user-facing behavior changes.

Doc Changes

Follows the style guide.
New pages include SPDX license header and frontmatter, if creating a new page.
Cross-references and links verified.

Signed-off-by: Krish Sapru ksapru@bu.edu

Summary by CodeRabbit

New Features
- Added a GPU end-to-end test option and an optional "launchable" CI flow for running it.
Chores
- Introduced a CI startup script to prepare GPU test environments and dependencies.
- Switched CI job execution to use ephemeral GPU instances.
Bug Fixes / Reliability
- Expanded failure-log collection and artifact uploads for remote test runs.
Documentation
- Updated e2e test docs with new env vars and TEST_SUITE values.

coderabbitai · 2026-04-03T19:11:20Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Migrates GPU E2E from a self-hosted runner to ephemeral Brev t4 instances: CI provisions a Brev GPU VM, waits for a readiness sentinel, runs remote tests via brev exec, fetches logs via brev scp, and always deletes the instance. Adds a startup script that installs Docker, NVIDIA toolkit, Node.js, and Ollama.

Changes

Cohort / File(s)	Summary
CI Workflow GPU Job `.github/workflows/nightly-e2e.yaml`	Replaces self-hosted GPU runner with `ubuntu-latest` job that creates a Brev t4 instance (requires `BREV_API_TOKEN`), polls for `/var/run/nemoclaw-launchable-ready`, runs tests via `brev exec` with selected `NEMOCLAW_*` env vars, collects remote logs via `brev scp`, and always deletes the Brev instance on teardown.
GPU Instance Provisioning Script `scripts/brev-launchable-ci-gpu.sh`	Adds new startup script (strict mode) that installs system packages, Docker, Node.js 22, OpenShell CLI, configures NVIDIA Container Toolkit if GPU present, installs Ollama and pulls a model, optionally pre-pulls Docker images, builds repo deps, validates GPU passthrough, and writes readiness sentinel.
E2E Test Docs / Config `test/e2e/brev-e2e.test.js`	Updates test-suite documentation to expose additional optional env vars and adds `gpu` as a selectable `TEST_SUITE` value plus `USE_LAUNCHABLE` toggle (default `"1"`).

Sequence Diagram

sequenceDiagram
    participant CI as CI Runner (ubuntu-latest)
    participant Brev as Brev CLI
    participant Instance as GPU Instance (ephemeral t4)
    participant Script as Startup Script

    CI->>Brev: install & authenticate Brev CLI
    CI->>Brev: brev create --script `scripts/brev-launchable-ci-gpu.sh`
    Brev->>Instance: provision t4 VM and run startup script
    Script->>Instance: install Docker, Node.js, NVIDIA toolkit, Ollama
    Script->>Instance: pull model, validate GPU (nvidia-smi)
    Script->>Instance: create `/var/run/nemoclaw-launchable-ready`

    loop Poll readiness
        CI->>Instance: check `/var/run/nemoclaw-launchable-ready`
        Instance-->>CI: ready (when created)
    end

    CI->>Brev: brev exec (run e2e tests, env vars)
    Brev->>Instance: execute tests
    Instance-->>Brev: test output & logs
    CI->>Brev: brev scp (copy /tmp install & test logs)
    CI->>Brev: brev delete --force || true

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰
In clouds of t4 I hop and cheer,
I spin up GPUs and bring them near,
Ollama hums, Docker purrs sincere,
Tests leap forward, logs appear —
Ready sentinel: "Hop on, dear!" 🥕

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Out of Scope Changes check	❓ Inconclusive	The PR includes modifications to test timeouts (install-preflight.test.js, nemoclaw-cli-recovery.test.js, onboard.test.js) which are not listed in linked issue `#1328` requirements but appear justified for reducing test flakiness per PR summary.	Clarify whether test timeout increases are part of issue `#1328` scope or represent separate improvements; document the rationale if intentional but out-of-scope changes.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	The PR fully addresses issue `#1328` requirements: new GPU startup script (scripts/brev-launchable-ci-gpu.sh) with NVIDIA Container Toolkit, Ollama pre-installed, test model pre-pulled, and integration into nightly-e2e.yaml workflow.
Title check	✅ Passed	The title accurately describes the primary change: adding a CI-ready GPU Brev launchable (scripts/brev-launchable-ci-gpu.sh) and integrating it into the E2E test workflow.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/nightly-e2e.yaml (1)
224-248: ⚠️ Potential issue | 🟠 Major

Delete the VM after failure logs are copied.

On failures, this always() step runs before both brev scp steps. That leaves nothing to copy, so the artifact uploads are empty right when diagnostics are needed most. Move teardown below the log-copy steps.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml around lines 224 - 248, The teardown step
"Tear down GPU instance" currently runs before the failure log copy steps,
causing logs to be deleted; move the "Tear down GPU instance" step (the block
that runs `brev delete e2e-gpu-nightly-${{ github.run_id }}` with if: always())
to after the "Copy install log on failure", "Upload install log on failure", and
"Copy test log on failure" steps so that the `brev scp` and upload actions can
run first and collect artifacts, keeping the teardown step's if: always()
behavior but ensuring it executes last.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 181-185: The "Install Brev CLI" step currently extracts the
archive directly to /usr/local/bin (and sets chmod) which can fail due to
permissions; change the step to either use sudo when writing to /usr/local/bin
or unpack into a writable user directory (e.g., $HOME/.local/bin) and ensure
that directory is created and added to PATH before subsequent steps; update the
step that references brev so it uses the updated install location (references:
the "Install Brev CLI" step and the target path /usr/local/bin or
$HOME/.local/bin and the brev binary name).
- Around line 193-195: The --startup-script argument passed to the brev create
command currently uses a remote URL which Brev CLI no longer accepts; update the
command that builds the instance (the brev create invocation with --name
"$INSTANCE_NAME" and --flavor "t4") to use the `@filepath` form by replacing the
URL value with `@scripts/brev-launchable-ci-gpu.sh` so the CLI reads the local
script file content instead of a URL.

In `@scripts/brev-launchable-ci-gpu.sh`:
- Around line 295-300: The until loop polling Ollama can hang indefinitely; add
a bounded timeout (e.g., MAX_WAIT_SECS) and check elapsed time or a counter
inside the loop that waits for http://localhost:11434, and if the timeout is
reached, kill the background process (use OLLAMA_PID), log a clear error, and
exit non‑zero so the CI fails fast; update the block that starts "ollama serve
>/dev/null 2>&1 &" and the subsequent loop that uses curl to implement this
timeout and cleanup.
- Around line 243-271: The GPU validation currently only logs warnings when
nvidia-smi is missing or the docker GPU test fails, allowing the startup to
continue; change the validation in the block that runs nvidia-smi and the docker
run to treat failures as fatal: when nvidia-smi is not found or returns
non-zero, or when docker run --gpus all nvidia/cuda:12.2.0-base nvidia-smi
fails, call an error handler (e.g., use error or fatal logging) and exit
non-zero (avoid creating the ready sentinel) instead of warn; update the code
around the nvidia-smi checks and the docker run command to exit with non-zero
status on failure so a broken GPU runtime is not marked ready.

In `@test/e2e/brev-e2e.test.js`:
- Around line 648-656: The test's GPU branch is never exercising GPU because
beforeAll always uses the CPU bootstrap (scripts/brev-launchable-ci-cpu.sh /
brev search cpu) and later asserts gpuEnabled: false; update the setup in the
beforeAll (or the bootstrap helper invoked there) to conditionally request a GPU
when TEST_SUITE === "gpu" — e.g., switch to the GPU launch script or a brev
search that returns a GPU launchable and ensure the resulting runtime's
gpuEnabled flag is true (references: TEST_SUITE, beforeAll,
scripts/brev-launchable-ci-cpu.sh, brev search cpu, gpuEnabled) so the "GPU E2E
suite passes on remote VM" test actually runs on a GPU environment.

---

Outside diff comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 224-248: The teardown step "Tear down GPU instance" currently runs
before the failure log copy steps, causing logs to be deleted; move the "Tear
down GPU instance" step (the block that runs `brev delete e2e-gpu-nightly-${{
github.run_id }}` with if: always()) to after the "Copy install log on failure",
"Upload install log on failure", and "Copy test log on failure" steps so that
the `brev scp` and upload actions can run first and collect artifacts, keeping
the teardown step's if: always() behavior but ensuring it executes last.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7119745b-c573-4f21-9c3e-b69f28d691f5

📥 Commits

Reviewing files that changed from the base of the PR and between f4a01cf and e6994ae.

📒 Files selected for processing (3)

.github/workflows/nightly-e2e.yaml
scripts/brev-launchable-ci-gpu.sh
test/e2e/brev-e2e.test.js

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/nightly-e2e.yaml (1)

224-249: ⚠️ Potential issue | 🔴 Critical

Critical: Teardown runs before log collection, preventing artifact retrieval.

The "Tear down GPU instance" step (line 224-227) uses if: always() and runs before the log copy steps (lines 228-234, 243-249). When tests fail:

Teardown executes and deletes the instance
Log copy steps attempt brev scp on a deleted instance
Logs are lost despite || true preventing step failure

The log copy steps must execute before the instance is deleted.

🐛 Fix: Reorder steps so log collection precedes teardown

          brev exec "$INSTANCE_NAME" -- bash -c \
            "cd ~/NemoClaw && \
             export NEMOCLAW_NON_INTERACTIVE=${NEMOCLAW_NON_INTERACTIVE} && \
             export NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=${NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE} && \
             export NEMOCLAW_SANDBOX_NAME=${NEMOCLAW_SANDBOX_NAME} && \
             export NEMOCLAW_RECREATE_SANDBOX=${NEMOCLAW_RECREATE_SANDBOX} && \
             export NEMOCLAW_PROVIDER=${NEMOCLAW_PROVIDER} && \
             export OLLAMA_MODEL=qwen3:0.6b && \
             bash test/e2e/test-gpu-e2e.sh"

-     - name: Tear down GPU instance
-       if: always()
-       run: brev delete e2e-gpu-nightly-${{ github.run_id }} || true
-
      - name: Copy install log on failure
        if: failure()
        env:
          INSTANCE_NAME: e2e-gpu-nightly-${{ github.run_id }}
        run: |
          brev scp "$INSTANCE_NAME":/tmp/nemoclaw-gpu-e2e-install.log /tmp/nemoclaw-gpu-e2e-install.log || true

      - name: Upload install log on failure
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: gpu-e2e-install-log
          path: /tmp/nemoclaw-gpu-e2e-install.log
          if-no-files-found: ignore

      - name: Copy test log on failure
        if: failure()
        env:
          INSTANCE_NAME: e2e-gpu-nightly-${{ github.run_id }}
        run: |
          brev scp "$INSTANCE_NAME":/tmp/nemoclaw-gpu-e2e-test.log /tmp/nemoclaw-gpu-e2e-test.log || true

      - name: Upload test log on failure
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: gpu-e2e-test-log
          path: /tmp/nemoclaw-gpu-e2e-test.log
          if-no-files-found: ignore
+
+     - name: Tear down GPU instance
+       if: always()
+       run: brev delete e2e-gpu-nightly-${{ github.run_id }} || true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml around lines 224 - 249, The teardown step
"Tear down GPU instance" currently runs before the log collection steps and
deletes the instance too early; move the "Tear down GPU instance" step to after
the "Copy install log on failure", "Upload install log on failure", and "Copy
test log on failure" steps so that brev scp can copy logs before deletion,
keeping its if: always() so it still runs on success/failure but ensuring its
position is last; verify the ENV INSTANCE_NAME usage in the copy steps remains
unchanged and preserve the upload step "Upload install log on failure" between
the copy and teardown steps.

🧹 Nitpick comments (1)

.github/workflows/nightly-e2e.yaml (1)

199-210: Polling timeout may be tight for GPU provisioning.

The polling loop waits a maximum of 10 minutes (20 iterations × 30s) for the instance to be ready. GPU instance provisioning plus the startup script execution (which installs Docker, NVIDIA Container Toolkit, and pre-pulls Ollama models) may exceed this window, especially during peak demand or cold starts.

Consider increasing the iteration count or adding visibility into why readiness failed:

♻️ Suggested improvement

          echo "Waiting for readiness sentinel..."
          export READY=0
-         for i in {1..20}; do
+         for i in {1..40}; do
+            echo "Poll attempt $i..."
             if brev exec "$INSTANCE_NAME" -- cat /var/run/nemoclaw-launchable-ready >/dev/null 2>&1; then
               READY=1
               break
             fi
-            sleep 30
+            sleep 15
          done

          if [ $READY -eq 0 ]; then
             echo "Instance did not become ready in time."
+            echo "Attempting to retrieve startup logs for diagnostics..."
+            brev exec "$INSTANCE_NAME" -- tail -100 /var/log/cloud-init-output.log 2>/dev/null || true
             exit 1
          fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml around lines 199 - 210, The current
polling loop uses for i in {1..20} with 30s sleeps and checks
/var/run/nemoclaw-launchable-ready via brev exec "$INSTANCE_NAME" and may time
out during GPU provisioning; increase the iteration count (e.g., to 40 or more)
or lengthen the sleep to allow more time, and add diagnostic logging when a
check fails (include the iteration number, output/stderr from brev exec
"$INSTANCE_NAME" and the absence of /var/run/nemoclaw-launchable-ready) so
failures are visible before exiting; ensure the READY variable logic remains
unchanged (set READY=1 on success and exit nonzero if still 0 after the loop).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 224-249: The teardown step "Tear down GPU instance" currently runs
before the log collection steps and deletes the instance too early; move the
"Tear down GPU instance" step to after the "Copy install log on failure",
"Upload install log on failure", and "Copy test log on failure" steps so that
brev scp can copy logs before deletion, keeping its if: always() so it still
runs on success/failure but ensuring its position is last; verify the ENV
INSTANCE_NAME usage in the copy steps remains unchanged and preserve the upload
step "Upload install log on failure" between the copy and teardown steps.

---

Nitpick comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 199-210: The current polling loop uses for i in {1..20} with 30s
sleeps and checks /var/run/nemoclaw-launchable-ready via brev exec
"$INSTANCE_NAME" and may time out during GPU provisioning; increase the
iteration count (e.g., to 40 or more) or lengthen the sleep to allow more time,
and add diagnostic logging when a check fails (include the iteration number,
output/stderr from brev exec "$INSTANCE_NAME" and the absence of
/var/run/nemoclaw-launchable-ready) so failures are visible before exiting;
ensure the READY variable logic remains unchanged (set READY=1 on success and
exit nonzero if still 0 after the loop).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c2445429-0f1d-49a3-abc3-b9e5e2a0b0dd

📥 Commits

Reviewing files that changed from the base of the PR and between b7eef04 and f170256.

📒 Files selected for processing (3)

.github/workflows/nightly-e2e.yaml
scripts/brev-launchable-ci-gpu.sh
test/e2e/brev-e2e.test.js

🚧 Files skipped from review as they are similar to previous changes (2)

test/e2e/brev-e2e.test.js
scripts/brev-launchable-ci-gpu.sh

wscurran · 2026-04-04T16:22:32Z

✨ Thanks for submitting this pull request, which proposes a way to improve CI reliability by adding a GPU-ready Brev launchable and increasing test timeouts for smoother E2E execution.

Possibly related open issues:

#1328 ci: create Flavor 2 Brev launchable — CI-Ready GPU

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/nightly-e2e.yaml (1)

288-312: ⚠️ Potential issue | 🔴 Critical

Step ordering bug: Teardown runs before log retrieval.

The "Tear down GPU instance" step uses if: always() and is positioned before the log copy steps. GitHub Actions executes steps sequentially, so:

Teardown deletes the instance
Copy log steps attempt brev scp on a non-existent instance

Both brev scp commands fail silently (|| true), resulting in no logs being uploaded on failure.

Proposed fix: Move teardown after log retrieval

-      - name: Tear down GPU instance
-        if: always()
-        run: brev delete e2e-gpu-nightly-${{ github.run_id }} || true
-
       - name: Copy install log on failure
         if: failure()
         env:
           INSTANCE_NAME: e2e-gpu-nightly-${{ github.run_id }}
         run: |
           brev scp "$INSTANCE_NAME":/tmp/nemoclaw-gpu-e2e-install.log /tmp/nemoclaw-gpu-e2e-install.log || true

       - name: Upload install log on failure
         if: failure()
         uses: actions/upload-artifact@v4
         with:
           name: gpu-e2e-install-log
           path: /tmp/nemoclaw-gpu-e2e-install.log
           if-no-files-found: ignore

       - name: Copy test log on failure
         if: failure()
         env:
           INSTANCE_NAME: e2e-gpu-nightly-${{ github.run_id }}
         run: |
           brev scp "$INSTANCE_NAME":/tmp/nemoclaw-gpu-e2e-test.log /tmp/nemoclaw-gpu-e2e-test.log || true

       - name: Upload test log on failure
         if: failure()
         uses: actions/upload-artifact@v4
         with:
           name: gpu-e2e-test-log
           path: /tmp/nemoclaw-gpu-e2e-test.log
           if-no-files-found: ignore

+      - name: Tear down GPU instance
+        if: always()
+        run: brev delete e2e-gpu-nightly-${{ github.run_id }} || true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml around lines 288 - 312, The "Tear down
GPU instance" step currently runs with if: always() before the log retrieval
steps, causing brev scp to fail against a deleted instance; move the entire
"Tear down GPU instance" step (the step with name "Tear down GPU instance" that
runs brev delete e2e-gpu-nightly-${{ github.run_id }} || true) to after both
"Copy install log on failure", "Upload install log on failure" and "Copy test
log on failure" steps so logs are copied/uploaded first, and keep its if:
always() so teardown still runs regardless of outcome.

🧹 Nitpick comments (1)

.github/workflows/nightly-e2e.yaml (1)

288-290: Consider using INSTANCE_NAME env var for consistency.

The teardown step hardcodes the instance name pattern while other steps use the INSTANCE_NAME env var. For maintainability, consider adding the env block:

Suggested change

       - name: Tear down GPU instance
         if: always()
+        env:
+          INSTANCE_NAME: e2e-gpu-nightly-${{ github.run_id }}
-        run: brev delete e2e-gpu-nightly-${{ github.run_id }} || true
+        run: brev delete "$INSTANCE_NAME" || true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml around lines 288 - 290, The teardown step
"Tear down GPU instance" hardcodes the instance name pattern instead of using
the shared INSTANCE_NAME env var; update that job step to reference the
INSTANCE_NAME environment variable (same symbol used in other steps) in the run
command and, if necessary, add an env: block to the step so it uses
INSTANCE_NAME rather than the literal e2e-gpu-nightly-${{ github.run_id }}
string to keep naming consistent and maintainable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 278-286: The exported environment variable is wrong: replace the
export of OLLAMA_MODEL with NEMOCLAW_MODEL so the test script
test/e2e/test-gpu-e2e.sh receives the intended model; update the command that
currently sets "export OLLAMA_MODEL=qwen3:0.6b" to "export
NEMOCLAW_MODEL=qwen3:0.6b" ensuring the pre-pulled qwen3:0.6b value is passed
into the test run (check the bash snippet that builds the brev exec command
where OLLAMA_MODEL is referenced).

---

Outside diff comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 288-312: The "Tear down GPU instance" step currently runs with if:
always() before the log retrieval steps, causing brev scp to fail against a
deleted instance; move the entire "Tear down GPU instance" step (the step with
name "Tear down GPU instance" that runs brev delete e2e-gpu-nightly-${{
github.run_id }} || true) to after both "Copy install log on failure", "Upload
install log on failure" and "Copy test log on failure" steps so logs are
copied/uploaded first, and keep its if: always() so teardown still runs
regardless of outcome.

---

Nitpick comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 288-290: The teardown step "Tear down GPU instance" hardcodes the
instance name pattern instead of using the shared INSTANCE_NAME env var; update
that job step to reference the INSTANCE_NAME environment variable (same symbol
used in other steps) in the run command and, if necessary, add an env: block to
the step so it uses INSTANCE_NAME rather than the literal e2e-gpu-nightly-${{
github.run_id }} string to keep naming consistent and maintainable.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e09c73ad-c500-4fc7-a71c-cc1273393282

📥 Commits

Reviewing files that changed from the base of the PR and between f170256 and e7fccac.

📒 Files selected for processing (1)

.github/workflows/nightly-e2e.yaml

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/nightly-e2e.yaml (1)

288-312: ⚠️ Potential issue | 🟠 Major

Step ordering bug: logs cannot be copied after instance deletion.

The "Tear down GPU instance" step runs with if: always() at line 289, but the "Copy install log on failure" (lines 292-297) and "Copy test log on failure" (lines 307-312) steps run after teardown. When a test fails, the instance is deleted before logs can be copied via brev scp, so the artifact uploads will always be empty.

Move teardown to be the final step, or move log copy steps before teardown.

Proposed fix — move teardown after log collection

-      - name: Tear down GPU instance
-        if: always()
-        run: brev delete e2e-gpu-nightly-${{ github.run_id }} || true
-
       - name: Copy install log on failure
         if: failure()
         env:
           INSTANCE_NAME: e2e-gpu-nightly-${{ github.run_id }}
         run: |
           brev scp "$INSTANCE_NAME":/tmp/nemoclaw-gpu-e2e-install.log /tmp/nemoclaw-gpu-e2e-install.log || true

       - name: Upload install log on failure
         if: failure()
         uses: actions/upload-artifact@v4
         with:
           name: gpu-e2e-install-log
           path: /tmp/nemoclaw-gpu-e2e-install.log
           if-no-files-found: ignore

       - name: Copy test log on failure
         if: failure()
         env:
           INSTANCE_NAME: e2e-gpu-nightly-${{ github.run_id }}
         run: |
           brev scp "$INSTANCE_NAME":/tmp/nemoclaw-gpu-e2e-test.log /tmp/nemoclaw-gpu-e2e-test.log || true

       - name: Upload test log on failure
         if: failure()
         uses: actions/upload-artifact@v4
         with:
           name: gpu-e2e-test-log
           path: /tmp/nemoclaw-gpu-e2e-test.log
           if-no-files-found: ignore

+      - name: Tear down GPU instance
+        if: always()
+        run: brev delete e2e-gpu-nightly-${{ github.run_id }} || true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml around lines 288 - 312, The teardown step
"Tear down GPU instance" is running before log collection so brev scp cannot
access the instance; move the "Tear down GPU instance" step to after the "Copy
install log on failure", "Upload install log on failure", and "Copy test log on
failure" steps (i.e., make those three log-copy/upload steps run prior to
teardown) and keep the teardown's if: always() so it still runs regardless of
outcome; ensure the log copy steps retain if: failure() so artifacts are
collected only on failure and then teardown runs last to delete the instance.

🧹 Nitpick comments (2)

scripts/brev-launchable-ci-gpu.sh (1)

211-238: Consider using docker directly instead of sg docker.

Since line 141 already runs chmod 666 /var/run/docker.sock, the sg docker -c "..." wrapper is unnecessary—the socket is world-accessible. Using docker pull directly would be simpler and avoid potential issues if the docker group wasn't properly added to the user.

Proposed simplification

   for image in "${DOCKER_IMAGES[@]}"; do
     info "  Pulling $image..."
-    sg docker -c "docker pull $image" 2>&1 | tail -1 \
+    docker pull "$image" 2>&1 | tail -1 \
       || warn "  Failed to pull $image (will be pulled at test time)"
   done
   ...
-  if ! sg docker -c "docker pull $CLUSTER_IMAGE" 2>&1 | tail -1; then
+  if ! docker pull "$CLUSTER_IMAGE" 2>&1 | tail -1; then
     warn "  Could not pull $CLUSTER_IMAGE — trying :latest"
-    sg docker -c "docker pull ghcr.io/nvidia/openshell/cluster:latest" 2>&1 | tail -1 \
+    docker pull "ghcr.io/nvidia/openshell/cluster:latest" 2>&1 | tail -1 \
       || warn "  Failed to pull openshell/cluster (will be pulled at test time)"
   fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/brev-launchable-ci-gpu.sh` around lines 211 - 238, The script
currently uses sg docker -c "docker pull ..." in the Pre-pull Docker images
section (inside the SKIP_DOCKER_PULL guard looping over DOCKER_IMAGES and when
pulling CLUSTER_IMAGE/openshell/cluster with CLUSTER_TAG); replace those sg
docker invocations with direct docker pull commands (e.g., docker pull $image
and docker pull "$CLUSTER_IMAGE" / docker pull
ghcr.io/nvidia/openshell/cluster:latest) while keeping the same 2>&1 | tail -1
piping and the existing warn fallbacks and messages, since /var/run/docker.sock
is already chmod 666 and the sg wrapper is unnecessary.

.github/workflows/nightly-e2e.yaml (1)

251-274: Readiness polling timeout may be too short for cold provisioning.

The loop polls 20 times with 30-second sleeps (10 minutes total). The startup script installs Docker, NVIDIA Container Toolkit, pulls Docker images, installs Ollama, and pulls models—this can exceed 10 minutes on a fresh VM. Consider increasing to 30 iterations (15 minutes) or adding a check for startup script failure.

Proposed fix

           echo "Waiting for readiness sentinel..."
           export READY=0
-          for i in {1..20}; do
+          for i in {1..30}; do
              if brev exec "$INSTANCE_NAME" -- cat /var/run/nemoclaw-launchable-ready >/dev/null 2>&1; then
                READY=1
                break
              fi
              sleep 30
           done

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml around lines 251 - 274, The readiness
polling loop using for i in {1..20} with 30s sleeps (READY variable and the
check via brev exec "$INSTANCE_NAME" -- cat /var/run/nemoclaw-launchable-ready)
can time out during cold provisioning; change the loop to more iterations (e.g.,
for i in {1..30}) to extend the wait to ~15 minutes and/or add a secondary
failure check after brev create to detect startup-script errors (inspect brev
logs or run a brief brev exec to check the startup script exit status) so the
workflow fails fast with actionable logs if the VM provisioning or startup
script failed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/brev-launchable-ci-gpu.sh`:
- Around line 41-59: The OPENSHELL_VERSION default is too old (v0.0.20) and must
be bumped to the minimum supported v0.0.22; update the OPENSHELL_VERSION
variable assignment in the script (the
OPENSHELL_VERSION="${OPENSHELL_VERSION:-v0.0.20}" line) to use v0.0.22 so it
matches requirements in scripts/install-openshell.sh and avoids sandbox
persistence failures during E2E tests.

---

Outside diff comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 288-312: The teardown step "Tear down GPU instance" is running
before log collection so brev scp cannot access the instance; move the "Tear
down GPU instance" step to after the "Copy install log on failure", "Upload
install log on failure", and "Copy test log on failure" steps (i.e., make those
three log-copy/upload steps run prior to teardown) and keep the teardown's if:
always() so it still runs regardless of outcome; ensure the log copy steps
retain if: failure() so artifacts are collected only on failure and then
teardown runs last to delete the instance.

---

Nitpick comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 251-274: The readiness polling loop using for i in {1..20} with
30s sleeps (READY variable and the check via brev exec "$INSTANCE_NAME" -- cat
/var/run/nemoclaw-launchable-ready) can time out during cold provisioning;
change the loop to more iterations (e.g., for i in {1..30}) to extend the wait
to ~15 minutes and/or add a secondary failure check after brev create to detect
startup-script errors (inspect brev logs or run a brief brev exec to check the
startup script exit status) so the workflow fails fast with actionable logs if
the VM provisioning or startup script failed.

In `@scripts/brev-launchable-ci-gpu.sh`:
- Around line 211-238: The script currently uses sg docker -c "docker pull ..."
in the Pre-pull Docker images section (inside the SKIP_DOCKER_PULL guard looping
over DOCKER_IMAGES and when pulling CLUSTER_IMAGE/openshell/cluster with
CLUSTER_TAG); replace those sg docker invocations with direct docker pull
commands (e.g., docker pull $image and docker pull "$CLUSTER_IMAGE" / docker
pull ghcr.io/nvidia/openshell/cluster:latest) while keeping the same 2>&1 | tail
-1 piping and the existing warn fallbacks and messages, since
/var/run/docker.sock is already chmod 666 and the sg wrapper is unnecessary.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 295481b2-0469-461e-ba0d-658198ef1f9a

📥 Commits

Reviewing files that changed from the base of the PR and between e7fccac and d053d0f.

📒 Files selected for processing (3)

.github/workflows/nightly-e2e.yaml
scripts/brev-launchable-ci-gpu.sh
test/e2e/brev-e2e.test.js

✅ Files skipped from review due to trivial changes (1)

test/e2e/brev-e2e.test.js

wscurran · 2026-04-14T22:26:36Z

Thanks for the GPU launchable work — pre-baking the NVIDIA container toolkit and Ollama models could meaningfully reduce E2E bootstrapping time. The codebase has changed since April 3, so this will need a rebase on origin/main before we can review it. Please rebase and resolve any conflicts, and we'll take a look.

ksapru · 2026-04-15T21:20:45Z

Hey @wscurran, Just rebased. cc: @jyaunches @cv

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

Comment thread .github/workflows/nightly-e2e.yaml Outdated

Comment thread .github/workflows/nightly-e2e.yaml Outdated

Comment thread scripts/brev-launchable-ci-gpu.sh

Comment thread scripts/brev-launchable-ci-gpu.sh

Comment thread test/e2e/brev-e2e.test.js Outdated

ksapru force-pushed the feat/gpu-ci-launchable branch from b7eef04 to 3095400 Compare April 4, 2026 16:00

coderabbitai bot reviewed Apr 4, 2026

View reviewed changes

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

Comment thread .github/workflows/nightly-e2e.yaml

ksapru force-pushed the feat/gpu-ci-launchable branch from e7fccac to 65bf6f5 Compare April 6, 2026 17:16

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

Comment thread scripts/brev-launchable-ci-gpu.sh

ksapru force-pushed the feat/gpu-ci-launchable branch from d053d0f to 3e02920 Compare April 6, 2026 17:24

ksapru mentioned this pull request Apr 6, 2026

ci: create Flavor 2 Brev launchable — CI-Ready GPU #1328

Open

wscurran added the status: rebase PR needs to be rebased against main before review can continue label Apr 14, 2026

cjagwani self-assigned this Apr 15, 2026

ksapru changed the title ~~ci: add CI-Ready GPU Brev launchable for E2E tests~~ feat(ci): add CI-Ready GPU Brev launchable for E2E tests Apr 15, 2026

feat(ci): Rebase

b20b60f

ksapru force-pushed the feat/gpu-ci-launchable branch from 18b51a2 to b20b60f Compare April 15, 2026 21:19

ksapru and others added 3 commits April 16, 2026 12:12

Merge branch 'main' into feat/gpu-ci-launchable

6024c61

Merge branch 'main' into feat/gpu-ci-launchable

4c95dd6

Merge branch 'main' into feat/gpu-ci-launchable

880fb32

cv requested a review from jyaunches April 16, 2026 22:37

ksapru and others added 5 commits April 16, 2026 22:55

ci: add CI-Ready GPU Brev launchable for E2E tests

47a4117

fix(ci): fix env variable name OLLAMA_MODEL to NEMOCLAW_MODEL

536c724

fix(ci): fix version to 0.22

64f02bb

feat(ci): add CI-Ready GPU Brev launchable for E2E tests

2626030

Merge branch 'main' into feat/gpu-ci-launchable

a557aba

Conversation

ksapru commented Apr 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Testing

Checklist

General

Code Changes

Doc Changes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

wscurran commented Apr 4, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wscurran commented Apr 14, 2026

Uh oh!

ksapru commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ksapru commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading